#### Replication Materials for: Denny and Spirling (2017) "Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It" ####

* If you have any questions or problems with these replication materials, please feel free to contact Matt Denny at matthewjdenny@gmail.com

* Note that these replication materials were generated using the following package versons on (10-19-17):
-- ggplot2 2.2.1
-- quanteda 0.99.12
-- preText 0.6.0
-- lda 1.4.2
-- R 3.4.1
In the course of replication we found that results (particularly for wordfish scores) can be slightly different when using much older versions of quanteda (from 2015). Make sure that you either use the package versions noted here, or expect that there may be slightly different results with newer or older versions of these packages. These will be due to minor bugfixes. In general, we suggest using the latest versions of all packages, as these will have the most up-to-date functionality, and the most current bugfixes.

This repo contains the data an scripts necessary to replicate all analyses and figures presented in the main text of the paper. Note that some of these scripts, particularly the LDA Perplexity analysis will require significant computational resources to replicate (high ram, many cores). Below, we provide a brief overview of the files contained in this repo:

* Scripts:

wordfish_analysis.R -- This script replicates the Wordfish analysis of the UK Manifestos corpus presented in Section 5.1 (Figure 1)

lda_analysis.R -- This script replicates the LDA perplexity and key terms analyses in Section 5.2 (Figures 2 and 3).

preprocess_data_and_generate_pretext_results.R -- This script replicates the preText regression analysis results in Section 6.1 (Figures 4 and 5).

wordfish_model_averaging.R -- This script replicates the Wordfish model averaging results presented in Section 6.2 (Figure 6).

wordfish_rank_plot_apriori_ordering.R -- This is a function which is used to create the plot on lines 67-75 of the wordfish_analysis.R file. 

* ./Data/

128_Combination_Preprocessing_Labels.RData -- this .RData file contains vectors storing the clean names for each preprocessing combination, and the number of steps each represents. These are used throughout the various scripts to make nicer looking output.

* The rest of these files contain the raw documents we used for our analyses for each of the eight corpora we consider in the main body of the paper. These documents are stored in a character vector, one per .RData object:

Death_Row_Statements.RData
House_Bills_113.RData
Indian_Treaties.RData
NYT_Articles.RData
Press_Releases.RData
SOTU_Speeches.RData
Trump_Campaign_Tweets.RData
UK_Manifestos.RData